Search CORE

30 research outputs found

Use of Weighted Finite State Transducers in Part of Speech Tagging

Author: Radev Dragomir R.
Tzoukermann Evelyne
Publication venue
Publication date: 01/01/1997
Field of study

This paper addresses issues in part of speech disambiguation using finite-state transducers and presents two main contributions to the field. One of them is the use of finite-state machines for part of speech tagging. Linguistic and statistical information is represented in terms of weights on transitions in weighted finite-state transducers. Another contribution is the successful combination of techniques -- linguistic and statistical -- for word disambiguation, compounded with the notion of word classes.Comment: uses psfig, ipamac

arXiv.org e-Print Archive

CiteSeerX

Recommended from our members

GIST-IT: Summarizing Email Using Linguistic Knowledge and Machine

Author: Klavans Judith L.
Muresan Smaranda
Tzoukermann Evelyne
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2001
Field of study

We present a system for the automatic extraction of salient information from email messages, thus providing the gist of their meaning. Dealing with email raises several challenges that we address in this paper: heterogeneous data in terms of length and topic. Our method combines shallow linguistic processing with machine learning to extract phrasal units that are representative of email content. The GIST-IT application is fully implemented and embedded in an active mailbox platform. Evaluation was performed over three machine learning paradigms

Columbia University Academic Commons

GIST-IT: Summarizing Email Using Linguistic Knowledge and Machine

Author: Evelyne Tzoukermann
Learning Evelyne Tzoukermann
Publication venue
Publication date
Field of study

The setting is a small Mississippi River town in the 1830s and the characters are the children and grown ups of the town. Tom Sawyer is the main character and you follow him around during the book. (Kuiper 1122) This book was largely based on Mark Twain\u92s boyhood. The famous whitewashing scene actually happened. Mark was Tom getting other little boys to do his work. He also got lost in that very same cave. Huck Finn was the same way. Huck was based on Twain\u92s boyhood friend/ \u93idol\u94, Tom Blankenship. Tom Sawyer, the main character of the work, is hardly the \u93model boy\u94. He is just like any other boy, mischievous and irresponsible, yet goodhearted. He reminds us all at how we used to be at that age. We did what ever we could to have fun. He is a thirteen year old boy filled with adventures and excitement

CiteSeerX

Issues In Text-To-Speech For French

Author: Evelyne Tzoukermann
Publication venue
Publication date: 01/01/1994
Field of study

This paper reports the progress of the French tcxt4o-speech system being developed at AT&T Bell Laboratories as part of a larger project for multilingual text-to-speech systems, including languages such as Spanish, Italian, German, Russjam and Chinese. These systems, based on diphone and triphone concatenation, follow the general framework of the Bell Laboratories English TTS system [?], [?]. This paper provides a description of the approach, the current status of the French text-to-peech project, and some problems particular to French

CiteSeerX

Crossref

La synthèse de la parole et le traitement automatique des langues.

Author: D'Alessandro Christophe
Tzoukermann Evelyne
Publication venue: 'Associacio catalana de Salut Laboral'
Publication date: 01/01/2001
Field of study

International audienc

Information Retrieval Based on Context Distance and Morphology

Author: Evelyne Tzoukermann
Hongyan Jing
Publication venue
Publication date: 01/01/1999
Field of study

We present an approach to information retrieval based on context distance and morphology. Context distance is a measure we use to assess the closeness of word meanings. This context distance model measures semantic distances between words using the local contexts of words within a single document as well as the lexical co-occurrence information in the set of documents to be retrieved. We also propose to integrate the context distance model with morphological analysis in determining word similarity so that the two can enhance each other. Using the standard vector-space model, we evaluated the proposed method on a subset of TREC-4 corpus (AP88 and AP90 collection, 158,240 documents, 49 queries). Results show that this method improves the 11-point average precision by 8.6%

CiteSeerX

Crossref

Columbia University Academic Commons

Recommended from our members

Using word class for part-of-speech disambiguation

Author: Radev Dragomir R.
Tzoukermann Evelyne
Publication venue: Fourth Workshop on Very Large Corpora (WVLC-4)
Publication date: 01/01/1996
Field of study

This paper presents a methodology for improving part-of-speech disambiguation using word classes. We build on earlier work for tagging French where we showed that statistical estimates can be computed without lexical probabilities. We investigate new directions for coming up with different kinds of probabilities based on paradigms of tags for given words. We base estimates not on the words, but on the set of tags associated with a word. We compute frequencies of unigrams, bigrams, and trigrams of word classes in order to further refine the disambiguation. This new approach gives a more efficient representation of the data in order to disambiguate word part-of-speech. We show empirical results to support our claim. We demonstrate that, besides providing good estimates for disambiguation, word classes solve some of the problems caused by sparse training data. We describe a part-of-speech tagger built on these principles and we suggest a methodology for developing an adequate training corpus

Columbia University Academic Commons

TamPub Julkaisuarkisto - TamPub Institutional Repository

Trepo - Institutional Repository of Tampere University

Using word class for part-of-speech disambiguation

Author: Dragomir R. Radev
Evelyne Tzoukermann
Publication venue
Publication date: 01/01/1996
Field of study

CiteSeerX

Columbia University Academic Commons

Recommended from our members

The automatic induction of concatenative units from machine readable dictionaries and corpora for speech synthesis

Author: Klavans Judith L.
Tzoukermann Evelyne
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/1994
Field of study

The purpose of this research is to determine the best method for deciding on an optimal set of concatenative units for concatenative speech synthesis. Of the two main approaches to speech synthesis: segmental synthesis and rule-based synthesis, the former relies heavily on the successful choice of concatenative units. Segment al synthesis consists of concatenating segmental units (diphones, triphones, etc); rule-based synthesis consists of the computation of control parameters based on pre-established rules. Deciding on the set of diphones is quite straightforward in the sense that it suffices to take the phoneme inventory of a language, and simply combine each phoneme with every other one. For example, taking the approximately 35 French phonemes, 1225 phonemic pairs (35x35) constitute the complete and exhaustive starting diphone inventory. On the other hand, deciding on the set of triphones, quadriphones and larger units raises difficult questions about the nature of phonemes in a given language such as: (1) stability vs instability in a coarticulatory environment, (2) size of overall inventory, and (3) frequency of that unit in the language, in combination with factors (1) and (2). We report on experiments with four different databases, with comparisons between the resources regarding their n-gram frequency output. The first two databases consist of pronunciation field information from two dictionaries, the Encyclopedic Robert French dictionary with 85,000 headwords, and the smaller Collins Gem containing 15,000 words. For comparison, we use two text corpora, the Hansard (about 2.5 million words) and the smaller Tubach and Boe corpus (80,000 words); both corpora were processed by a set of grapheme-to-phoneme rules. A frequency extraction program was applied to all four resources to extract trigram phonemic frequencies; this serves as a basis for comparison between dictionary derived data and corpus derived, frequencies

Columbia University Academic Commons